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Abstract 

Automatically detecting discourse segments is an important preliminary step towards full discourse parsing. Pre- 
vious research on discourse segmentation have relied on the assumption that elementary discourse units (EDUs) 
in a document always form a linear sequence (i.e., they can never be nested). Unfortunately, this assumption 
turns out to be too strong, for some theories of discourse like SDRT allows for nested discourse units. In this 
paper, we present a simple approach to discourse segmentation that is able to produce nested EDUs. Our ap- 
proach builds on standard multi-class classification techniques combined with a simple repairing heuristic that 
enforces global coherence. Our system was developed and evaluated on the first round of annotations provided 
by the French Annodis project (an ongoing effort to create a discourse bank for French). Cross-validated on only 
47 documents (1, 445 EDUs), our system achieves encouraging performance results with an F-score of 73% for 
finding EDUs. 



1. Introduction 

Discourse parsing is the analysis of a text from 
a global, structural perspective: how parts of a 
discourse contribute to its global interpretation, 
accounting for semantic and pragmatic effects 
beyond simple sentence concatenation. This task 
consists in two main steps: (i) finding the elemen- 
tary discourse units (henceforth EDUs), and (ii) 
organizing them in a way that make explicit their 
functional (aka rhetorical) relations. Popular 
theories of discourse include Rhetorical Struc- 



discourse corpus of French texts. 

In addition to being a necessary step in discourse 
parsing, discourse segmentation, could also be 
useful as a stand-alone application for a variety 
of other tasks where EDUs could provide sim- 
pler input than sentences. Examples of such tasks 
are: automatic summarization and sentence com- 
pression, bitext alignment, translation, chunk- 
ing/syntactic parsing. 

The first discourse segmentation system dates 



ture Theory (RST) (Mann and Thompson, 1987 1, 
Discourse Lexicalized Tree- Adjoining Grammar 
(DLTAG) d Webber, 2004! ), Segmented Discourse 



Representation Theory (SDRT) dAsher, 19931 - 
Each of these theoretical frameworks has been at 
the center of important corpus building efforts, 
dCarlson et al, 2003t [Prasad et al.,"2004 



see 



Baldridge et al., 2007] ) respectively. In the 
present work, we focus on the first step, namely 
segmenting a discourse into EDUs, within a 
larger project aiming at building an SDRT 



back to the rule-based work of (Ejerhed, 1996), 
which was a component in the RST-based 
parser of ( [Marcu, 2000| ). More recently, 
dTofiloski et al., 2009] ) tested a rule-based 
segmenter on top of a syntactic parser, 
achieving F-score of 80-85% in segment 
boundary identification on a slightly modi- 
fied RST corpus. Machine learning based 
segmentation systems have also been pro- 
posed, notably by dSoricut and Marcu, 2003| ), 
(Sporleder and Lapata, 2005 1 ) and 
(Fisher and Roark, 2007| ). The latter report 



F-score of 90.5% in boundary detection (and 
85.3% in correct bracketing) on the RST corpus. 
Discourse segmentation is to large extent theory 
dependent, for different theories make different 
assumptions on what EDUs can be. Carried out 
on the RST corpus, previous work on discourse 
segmentation has exploited an important particu- 
larity of this corpus: namely, the fact that it does 
not have any embedded EDUs. These approaches 
have been able to recast discourse segmentation 
as a binary classification problem: that is, each 
text position (token or token separator) is either a 
segment boundary or not. By contrast to RST, 
other theories like SDRT allows for embedded 
EDUs: embedding is used to encode modifying 
clauses like non restrictive relatives (including re- 
duced relatives) and appositions. As will be dis- 
cussed in Section[2n our SDRT-based corpus does 
contain close to 10% of nested EDUs. 
Predicting nested structures introduces additional 
difficulties, in particular that of outputting a co- 
herent, balanced bracketing. This characteris- 
tic renders discourse segmentation akin to syn- 
tactic clause boundary identification (CBI), a 
task which has received some attention from 
the CL community. The main approach to 
CBI is to classify tokens into three classes for 
clause start, end, or inside. The best results ob- 
tained during the CoNLL-2001 campaign were 
89-90% for boundary detection and 81.73% for 
correct clause identification (correct guessing 
of start and end), with boosted decision trees 



( [Carreras and Marquez, 2001] ). 
We have adapted this general setting to the prob- 
lem of discourse segmentation, with possible em- 
bedded segments, and applied it to a corpus of 
French discourses, part of an on-going corpus 
building project. 

2. Data and Evaluation 
2.1. Corpus 

The corpus we use has been developed as part of 
the Annodis project^ an on-going effort to anno- 
tate French discourses from various genres with 
both top-level, typographic structures and local 
coherence relations. About 100-150 texts are 
being segmented and annotated with coherence 
relations. These documents are drawn mainly 



for wikipedia articles and from L'est republicain 
newspaper ^. Text length varies from 300 to 900 
tokens. Annotations are performed by pairs of 
human annotators in a two-step process: (i) in- 
dividual annotations, and (ii) adjudication. The 
present work considers the 47 texts that have un- 
dergone validation. The average number of EDUs 
per document in this set is 33. 

Segments typically correspond to verbal clauses, 
but also other syntactic units describing eventu- 
alities (such as prepositional phrases), adjuncts 
such as appositions or cleft constructions with 
discursive long-range effects such as frame ad- 
verbials. A particularity of the discourse units in 
Annodis is that they can be embedded in one an- 
other, as in example in figure 1 (brackets mark 
segmentation). 

In this example, the EDUs tti mondialement con- 
nues, and 112 done difficilement ecoulables, are 
nested within the the main, discontinuous EDU 
TTo Ces pieces avaient ete reperees chez un riche 
amateur nippon. 

2.2. Evaluation 

Discourse segmentation evaluation is 
typically performed in terms of preci- 
sion, recall, and F-score for segment 
boundaries (|Soricut and Marcu, 2003 



IFisher and Roark, 2007 



[Sporleder and Lapata, 2005 ). Previous work 
differ as to whether they include sentence bound- 
aries (e.g., ( [Soricut and Marcu, 2003| ) are only 
interested in sentence-internal segmentation) and 
whether they additionally require labeling of the 
segments ( |Sporleder and Lapata, 2005| ). 

Since the type of segmentation we produce in- 
cludes nested EDUs, we have to resort to an- 
other type of evaluation. For this paper, we use 
the three metrics commonly used for evaluating 
clause detection: (i) precision, recall, and F-score 
for segment start position, (ii) precision, recall, 
and F-score for segment end position, and (iii) 
precision, recall, and F-score for complete seg- 
ments. These metrics correspond to three tasks 
included in the CoNLL 2001 shared task. 



^http : / / w 3 . erss . univ-tlse2 . fr /textes /pS^'^^o^i/^aim^fBJxm3!±ihVi^r / corpus / est repuhlicain 



[Ces pieces, [mondialement connues,]^, [done difficilement ecoulables,]^^ avaient ete 
reperees chez un riche amateur nipponj^o 

[The pieces, [worldwide famous, ]t,^ [thus hard to resell,]^^^ had been located at a rich Japanese art 
lover's Jt^q 

Figure 1: A discourse segmentation from the Annodis corpus. 



3. Approach 

3.1. Classification Model 

Like previous approaches to discourse segmen- 
tation and CBI, we cast the task of EDU iden- 
tification as a classification problem. Specifi- 
cally, we built a four-class, classifier that maps 
each token Wi in a discourse wi,...,Wn to 
one of the following boundary types B = 
{left, right, both, nothing}. These cor- 
respond to the different bracketing configurations 
found in our corpus, respectively (i) Wi opens 
a segment, (ii) Wi ends a segment, (iii) Wi is a 
single-token segment, and (iv) none of the above. 
If we take the beginning of the example in I2.1.[ 
[Ces pieces, [mondialement connues,] Ces and 
mondialement would be classified as left, the 
last comma as right, and all other tokens as 
nothing. 

For our classifier, we used a regularized 
maximum entropy (MaxEnt, for short) model 
( Berger et al., 1996] ). In MaxEnt, the parameters 
of an exponential model of the following form are 
estimated: 



P{b\t) 



Z{b) 



exp ^Wifi{t,b) 



where t represents the current token and b the out- 
come (i.e., the type of boundary). Each token t 
is encoded as a vector of m indicator features /j. 
There is one weight/parameter Wi for each feature 
fi that predicts its classification behavior. Finally, 
Z(b) is a normalization factor over the different 
class labels (in this case, the 4 boundary types), 
which guarantees that the model outputs proba- 
bilities. 

In MaxEnt, the values for the different parameters 
w are obtained by maximizing the log-likelihood 
of the training data T with respect to the model 



(Berger etal., 1996): 



Various algorithms have been proposed 
for performing parameter estimation (see 
dMalouf, 2002| ) for a comparison). Here, we used 
the Limited Memory Variable Metric Algorithm 
implemented in the MegaM package.^ We used 
the default regularization prior that is used in 
MegaM. 

3.2. Feature Set 

Our feature set relies on two main sources of in- 
formation. The first source is a list of lexical 
markers, containing discourse connectives and a 
few indirect speech report verbs that are likely to 
introduce discourse units. Specifically, we cre- 
ated boolean features that check whether the to- 
ken is part of connectives (resp. verbs) in our list 
of markers. 

The other information source is (morpho- 
)syntactic, drawn from the automatic anal- 
ysis provided by the Macaon chunker 
^Nasr and Volanschi, 2006| ) and the Syntex 
dependency parser ([Bourigault et al., 2005|). 



w 



argmax 



Using these two analyzers, we extract for each 
token: its lemma, its part-of-speech (POS) tag, 
its chunk tag, its dependency path to the root 
element (as well as "sub-paths" of length 1-3), 
and its inbound dependencies. In addition, we 
also capture the linear position of the word in 
a sentence (we used quantized values ranging 
from 1-100). These feature templates were also 
applied to the surrounding words in a window of 
3 words to the left and right. 
Two more feature families were added. The first 
concerns the outward chunk sequence for each to- 
ken; that is, given that a token is embedded in 
a sequence of chunks, we start from the inner- 
most chunk tag and we go out all the way to the 
outermost. These features exploit the fact that 
Macaon provides some level of embedding in its 
chunks. The second feature family concerns all 

^Available from |http : / /www . cs . utah . edu/~hal/meg 



the n-gramms 1 < n < 6 for which the token 
is included and their span does not exceed the 
boundaries of the current sentence. A synoptic 
table with the entire feature set we used is shown 
in tabled 

3.3. Resampling 

The distribution of boundary types is heavily 
skewed towards nothing (about 12.000 in- 
stances against about 1400 for each left and 
right), which suggests that resampling our data 
toward a more uniform distribution might lead to 
better classification accuracy, and in turn to better 
EDU segmentation. 

The resampling method we used directly exploits 
the syntactic chunk boundaries as found by the 
Macaon chunker. It is based on the observation 
that EDU boundaries in a large majority of cases 
coincide with chunk boundaries. The output of 
Macaon was used in the following ways. First, 
we decided to replace the decisions on sentence 
boundary tokens with the decisions that Macaon 
provides. In other words, sentence boundary to- 
kens, as given by Macaon, were ignored during 
training; they were tagged as left and right 
respectively during test. Second, we also re- 
moved from training tokens that were strictly in- 
side chunks (that is, tokens that are inside a chunk 
but doesn't correspond to its beginning or end). 
At test, these tokens were assigned the nothing 
class. All remaining tokens were used for training 
and follow the classification decoding at test. Af- 
ter those modifications, the class distribution was 
around 9200 instances for the class nothing, 
while the rest of the classes had around 1400 in- 
stances. 

3.4. Enforcing coherence 

Casting segmentation as a series of local classifi- 
cations has two major drawbacks. First, the seg- 
mentation decision at a token is highly dependent 
from the decisions on neighboring tokens. Sec- 
ondly, unrelated local decisions do not guaran- 
tee the well-formedness of the segmentation of a 
sentence, since we allow for embedded segments. 
For instance, the number of beginning of embed- 
ded segments must obviously match the number 
of endings. 

A straightforward way to capture Markovian de- 
pendencies between segmentation labels is to en- 



code previous labels as features of the model, 
in combination with a Viterbi decoding. Un- 
fortunately, we found during development that 
this strategy degrades segmentation performance, 
probably due to the sparsity of the boundary la- 
bels. ^ 

To tackle the problem of ensuring a coherent 
bracketing, we propose a specific post-processing 
on the outputs of the classifier. In particular, we 
apply heuristic repair techniques (adding/deleting 
boundaries) to yield a well-formed sentence seg- 
mentation. A simple technique proved efficient 
enough: we scanned sentences token by token 
from beginning to end, while keeping track of the 
depth of the current EDU embedding. If the depth 
is before the end of a sentence, it means we 
found a stranded token, that is then reclassified 
as left; this rebalances the number of left 
and right. Dually, we reversed the sequences 
to reclassify remaining out-of-segment tokens as 
right. This heuristic is illustrated in figure 2. 
In the future we plan to apply local optimization 
techniques under well-formedness constraints, to 
repair segmentations while better preserving the 
probability on each decision. 

4. Experiments and Results 

We present two sets of scores, one without post- 
processing and one with post-processing. We did 
a 10-fold cross-validation on the sentences con- 
tained in the 47 documents of the corpus. We 
used the three metrics for segmentation evalua- 
tion discussed in section [23 we also report preci- 
sion, recall, and F-score for the both boundary 
class. 

Table [2] (resp. table [3]) reports the perfor- 
mance scores of the "classifier-only" system 
(resp. "classifier-i-post-processing" system) for 
the first series of experiments. In terms of overall 
classification performance, both systems perform 
similarly, but the second system improves on the 
three boundary classes {left, right, both}. 
The main source of improvement comes from re- 
call, which suggests that our heuristics recover 
boundaries that were missed by the classifier. 
Before post-processing, the proportion of not 
well-formed segmentations on the (recognized) 

^Similar findings are reported by 
dFisher and Roark, 2007D . 





F)p 'spri nti on 


Lemma 


the token's lemma (Syntex) 


POS 


Part of speech (Macaon) 


Grammatical category 


the main grammatical category of the token: V, N, P, etc. (Syntex) 


start of a discourse marker 


boolean, indicating whether the tokens starts a discourse marker 


indirect speech report verb 


boolean, indicating whether the token belongs to a predefined 




list of verbs. 


dependency path 


the dependency path from the word towards the root, limited 




to distance 3 (Syntex) 


inbound dependencies 


the inbound dependency relations for each token (Syntex) 


syntactic projections 


the number of times that the token is at the start, end or middle 




of an NP, VP, PP projection (Syntex) 


distance from sentence boundaries 


the relative distance from each of the sentence boundaries 


context 3 -grams 


the lemma and POS 3-grams before and after the 




token (Syntex & Macaon) 


chunk start/end 


boolean features; token coincides with a chunk start/end (Macaon) 


ouiWaTG cnuiiK tag sequencc 


ine sequence or cnunx lags irom me innermosi lo me 




outermost chunk (Macaon) 


context n-gramms 


all the n-gramms (1 < n < 6) that include the token and do 




not exceed the limits of the sentence. The n-grams include 




Lemmas (Syntex), POS tags (Macaon) and Chunk tags (Macaon) 



Table 1: Features used for the second approach (including chunks). 



Input from classifier: 

[The pieces,] worldwide famous,] thus hard to resell,] had been located [at a rich Japanese art 
lover's] 

First pass left-to-right: 

[The pieces,] [worldwide famous,] [thus hard to resell,] [had been located [at a rich Japanese art 
lover's] 

First pass right-to-left: [The pieces,] [worldwide famous,] [thus hard to resell,] [had been located] 
[at a rich Japanese art lover's] 



Figure 2: Example repairing of a not well-formed segmentation with additions underlined. The sen- 
tence can now be compared to the reference, cf figure 1 . 



sentences is 35%, our post-processing heuristics 
yield 98% well-formed segmentations. The im- 
pact on precision/recall is shown in tabled 

The overall bad performance on both is due 
to the lack of data for this class: there are less 
than 20 examples in the entire corpus. When it 
comes to the segment evaluation, again the best 
results were achieved by the second approach 
which managed to correctly identify 73% of the 
manually annotated segments. These results are 
slightly less, but close to, the best results obtained 



by systems on the CBI task. Of course, the main 
reason post-processing boosts the EDU score is 
that a third more of the sentences are now evalu- 
ated, since they are well-formed. But the decline 
in precision is much less than the gain in recall. 



4.1. Learning Curve 

For their RST EDU 
ments. 



segmentation expen- 
Fisher and Roark (2007 1 ) have been 



using the RST-DT corpus which consists of 
a total of 385 documents (176, 000 tokens). 



Class 


Recall 


Precision 


F-measure 


Left 


0.845 


0.891 


0.868 


Right 


0.881 


0.925 


0.902 


Both 


0.684 


0.812 


0.742 


EDUs 


0.427 


0.880 


0.575 



Table 2: Evaluation without post-processing. 



Class 


Recall 


Precision 


F-measure 


Left 


0.876 


0.880 


0.878 


Right 


0.885 


0.889 


0.888 


Both 


0.684 


1.0 


0.812 


EDUs 


0.719 


0.748 


0.733 



Table 3: Evaluation with post-processing. 



Carreras and Marquez (2001 1 have used the 
CoNLL 2001 corpus for the task of clause 
boundaries identification: this corpus includes 
sections 15 — 18 of the Penn Treebank for training 
(211, 727 tokens) and section 20 for test (47, 377 
tokens). In contrast to those approaches we have 
worked, as mentioned in section [2] we have been 
working with 47 validated documents (14384 
tokens) from the Annodis project. Given that the 
number of documents that we have been working 
with is limited, at least in comparison with other 
approaches, we have calculated the learning 
curve for this number of documents in order 
to understand how the learning procedure will 
be influenced once we have the totality of our 
documents annotated. As mentioned in section [2l 
the total number of documents expected will be 
in the range of 100 to 150. 
In order to calculate our learning curve, we di- 
vided our corpus into 9 different learning sets, 
starting from 5 random documents and incremen- 
tally adding 5 random documents into each learn- 
ing set. For each such set we performed a ten-fold 
cross-validation, in the same way as described in 
section |43 using the feature set shown in tabled 
The learning curve is shown in figure 3. As it 
can be seen from this figure, the curves for both 
classes (left and right) grow regularly be- 
tween sets 5 to 30 while it seems to plateau be- 
tween sets 30 and 40 only to start going up again 
during the last set of documents. In general, it 
seems that the addition of more documents will 



only slightly increase the performance of our ap- 
proach. 

5. Conclusions and Future Work 

Discourse segmentation is a crucial preprocess- 
ing stage for discourse analysis, and the global 
reliability of discourse parsing is heavily deter- 
mined by success at this level. We have proposed 
a simple approach combining a 3-class classifier 
with a post-processing heuristics that achieve rea- 
sonable results, although the data available at the 
moment is limited. We need to see how this gen- 
eralizes to the whole corpus, and to check how 
dependent it is on the nature of the corpus (news- 
paper articles and encyclopedia article). Another 
angle we plan to investigate is the usefulness of a 
non-perfect segmentation to help annotators start 
discourse annotation. Given the cost of human 
annotation of discourse, saving time on the seg- 
mentation would be a boost to annotators pro- 
ductivity, provided we verify that time spent is 
roughly proportional to the number of errors in 
the automated preprocessing; that hypothesis is 
not necessarily true, and there might be a thresh- 
old on the precision of the processing that is ac- 
ceptable. Mainly, the ideal trade-off between pre- 
cision and recall remains to be investigated. 
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